This notebook will be showing exploratory data analysis for the subset of the Vancouver Street Trees dataset located here.
# Import libraries needed for this assignment
import altair as alt
import pandas as pd
import os
# alt.data_transformers.enable("data_server")
Let's import the subset of the Vancouver Street Trees data. Since this is a new dataset,let's take a good first step to get familiar with it by glancing at the values in the dataframe.
trees_df = pd.read_csv('small_unique_vancouver.csv')
trees_df.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaN | 12.0 | ODD | PINUS | N | ... | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaN | 15.5 | ODD | AESCULUS | Y | ... | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
5 rows × 21 columns
Next, let's check the type of data in each column and how many missing values there are.
trees_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 5000 non-null int64 1 std_street 5000 non-null object 2 on_street 5000 non-null object 3 species_name 5000 non-null object 4 neighbourhood_name 5000 non-null object 5 date_planted 2363 non-null object 6 diameter 5000 non-null float64 7 street_side_name 5000 non-null object 8 genus_name 5000 non-null object 9 assigned 5000 non-null object 10 civic_number 5000 non-null int64 11 plant_area 4950 non-null object 12 curb 5000 non-null object 13 tree_id 5000 non-null int64 14 common_name 5000 non-null object 15 height_range_id 5000 non-null int64 16 on_street_block 5000 non-null int64 17 cultivar_name 2658 non-null object 18 root_barrier 5000 non-null object 19 latitude 5000 non-null float64 20 longitude 5000 non-null float64 dtypes: float64(3), int64(5), object(13) memory usage: 820.4+ KB
From the above infomation,the datatype of date_planted is object, we need to parse dates as numbers. We can specify parse_dates=['date_planted'] to read_csv again.
Also, it looks like there are some NaNs in three of the columns, and the date_planted and cultivar_name seem to have the most: about half rows are missing a value.
Now we are parsing the dates and then we'll reprint the info of the dataset.
trees_df = pd.read_csv('small_unique_vancouver.csv',parse_dates=['date_planted'])
trees_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 5000 non-null int64 1 std_street 5000 non-null object 2 on_street 5000 non-null object 3 species_name 5000 non-null object 4 neighbourhood_name 5000 non-null object 5 date_planted 2363 non-null datetime64[ns] 6 diameter 5000 non-null float64 7 street_side_name 5000 non-null object 8 genus_name 5000 non-null object 9 assigned 5000 non-null object 10 civic_number 5000 non-null int64 11 plant_area 4950 non-null object 12 curb 5000 non-null object 13 tree_id 5000 non-null int64 14 common_name 5000 non-null object 15 height_range_id 5000 non-null int64 16 on_street_block 5000 non-null int64 17 cultivar_name 2658 non-null object 18 root_barrier 5000 non-null object 19 latitude 5000 non-null float64 20 longitude 5000 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(5), object(12) memory usage: 820.4+ KB
Visualizing missing values helps us identify potential issues with the data.
alt.data_transformers.disable_max_rows();
trees_nans = trees_df.isna().reset_index().melt(id_vars='index', var_name='column', value_name='NaN')
trees_nans
| index | column | NaN | |
|---|---|---|---|
| 0 | 0 | Unnamed: 0 | False |
| 1 | 1 | Unnamed: 0 | False |
| 2 | 2 | Unnamed: 0 | False |
| 3 | 3 | Unnamed: 0 | False |
| 4 | 4 | Unnamed: 0 | False |
| ... | ... | ... | ... |
| 104995 | 4995 | longitude | False |
| 104996 | 4996 | longitude | False |
| 104997 | 4997 | longitude | False |
| 104998 | 4998 | longitude | False |
| 104999 | 4999 | longitude | False |
105000 rows × 3 columns
alt.Chart(trees_nans).mark_rect(height=17).encode(
x='index:O',
y='column',
color='NaN',
stroke='NaN').properties(width=900)
By visualizing the missing values for each column next to each other, we can quickly see if there are similar patterns between columns.From the above plot we find that the missing values from cultivar_name and date_planted are not exactly the same rows,although they both have about half rows missing a value.The column plant_area has only 1% rows missing a value.
Since cultivar_name and plant_area are categorical columns showing trees description information,we are not dropping these NaN values if we are not interested in them.For the column date_planted,we can drop the NaN values when we focus on the statistics related to the time. Considering almost half of rows missing a value in date_planted, we might keep the NaN values rather than drop them when we deal with time unrelated statistics.
Now let’s print out the summary statistics for the numerical columns.
trees_df.describe()
| Unnamed: 0 | diameter | civic_number | tree_id | height_range_id | on_street_block | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 14861.920400 | 12.340888 | 2975.707600 | 128682.584600 | 2.73440 | 2960.227000 | 49.247349 | -123.107128 |
| std | 8680.023278 | 9.266600 | 2078.580429 | 75412.260406 | 1.56957 | 2086.861052 | 0.021251 | 0.049137 |
| min | 2.000000 | 0.000000 | 2.000000 | 36.000000 | 0.00000 | 0.000000 | 49.202783 | -123.220560 |
| 25% | 7192.750000 | 4.000000 | 1300.500000 | 61321.500000 | 2.00000 | 1300.000000 | 49.230152 | -123.144178 |
| 50% | 14870.000000 | 10.000000 | 2639.000000 | 130130.500000 | 2.00000 | 2600.000000 | 49.247981 | -123.105861 |
| 75% | 22366.750000 | 18.000000 | 4123.000000 | 191332.000000 | 4.00000 | 4100.000000 | 49.263275 | -123.063484 |
| max | 29992.000000 | 71.000000 | 9113.000000 | 270750.000000 | 9.00000 | 9100.000000 | 49.293930 | -123.023311 |
Visualizing the distributions of all numerical columns helps us understand the data.
The first column unnamed:0 seems like the id for each row in the original dataset,we have not much interest in it when discovering the numerical columns relationships through visualization. We are going to ignore this column in the following numerical columns exploring.
# remove the first column (unnamed:0)from numerical columns
numerical_columns = trees_df.iloc[:,1:].select_dtypes('number').columns.tolist()
#numerical_columns =
(alt.Chart(trees_df)
.mark_bar().encode(
alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=25)),
y='count()')
.properties(width=220, height=150)
.repeat(numerical_columns,columns=3))
This overview tells us that most trees have a diameter of less than 5 in, and height range id between 1 to 2. As trees get bigger and taller,the count numbers are going down.Also, the civic number and street blocks number seem to share the same distribution.
Repeating columns of both X and Y lets us effectively explore pairwise relationships between columns.
# Scroll right on the plot to see the last column
(alt.Chart(trees_df)
.mark_point(size=10).encode(
alt.X(alt.repeat('column'), type='quantitative'),
alt.Y(alt.repeat('row'), type='quantitative'))
.properties(width=80, height=120)
.repeat(column=numerical_columns, row=numerical_columns))
Unfortunately, these plots are saturated, so although we can see that there might be some correlative relationships, we should remake this plot as a 2D histogram heatmap.
# Scroll right on the plot to see more columns
(alt.Chart(trees_df)
.mark_rect().encode(
alt.X(alt.repeat('column'), type='quantitative', bin=alt.Bin(maxbins=30)),
alt.Y(alt.repeat('row'), type='quantitative', bin=alt.Bin(maxbins=30)),
alt.Color('count()', title=None))
.properties(width=110, height=110)
.repeat(column=numerical_columns, row=numerical_columns)).resolve_scale(color='independent')
From the above heatmaps, we find that diameter and height might have a positive relationship when diameter is less than 25 inches. Also,we can learn that civic number and block number are related to longitude and latitude and it provides some interesting aspects related to geographic distribution.
Besides, visualizing the counts of all categorical columns helps us understand the data.Considering some columns have too many values and here we just select a subset of categorical columns to explore.
categorical_columns = ['street_side_name','curb','neighbourhood_name','root_barrier']
# categorical_columns = trees_df.select_dtypes('object').columns.tolist()
(alt.Chart(trees_df)
.mark_bar().encode(
alt.X('count()'),
alt.Y(alt.repeat(), type='nominal', sort='x',title=''))
.properties(width=80, height=200)
.repeat(categorical_columns))
We can learn that some distributions are interesting such as how trees were planted in different street sides and neighbourhoods.Now we are going to explore more fun aspects of the data further in the following exploratory visualizaions.
trees_df = trees_df.assign(year_planted=(trees_df['date_planted'].dt.year.astype('Int64')))
trees_with_date_df = trees_df[trees_df['date_planted'].notna()]
trees_with_date_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2363 entries, 0 to 4998 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 2363 non-null int64 1 std_street 2363 non-null object 2 on_street 2363 non-null object 3 species_name 2363 non-null object 4 neighbourhood_name 2363 non-null object 5 date_planted 2363 non-null datetime64[ns] 6 diameter 2363 non-null float64 7 street_side_name 2363 non-null object 8 genus_name 2363 non-null object 9 assigned 2363 non-null object 10 civic_number 2363 non-null int64 11 plant_area 2328 non-null object 12 curb 2363 non-null object 13 tree_id 2363 non-null int64 14 common_name 2363 non-null object 15 height_range_id 2363 non-null int64 16 on_street_block 2363 non-null int64 17 cultivar_name 1678 non-null object 18 root_barrier 2363 non-null object 19 latitude 2363 non-null float64 20 longitude 2363 non-null float64 21 year_planted 2363 non-null Int64 dtypes: Int64(1), datetime64[ns](1), float64(3), int64(5), object(12) memory usage: 426.9+ KB
alt.Chart(trees_with_date_df).mark_bar().encode(
alt.X('year_planted'),
alt.Y('count()')
).properties(width=500,height=200)
From the above plot,we can easily find that most trees were planted in 1996,2002 and 2013. We are going to find out more about trees planted in different neighbourhood over these years.
alt.Chart(trees_with_date_df).mark_rect().encode(
alt.X('year_planted',bin=alt.Bin(maxbins=25)),
alt.Y('neighbourhood_name'),
alt.Color('count()')).properties(width=510, height=410)
From the above heatmap, we learn that most trees were planted in Hastings-Sunrise,Kensington-Cedar Cottage , Renfres-Collingwood,Sunset and Victoria-Fraserview from 1992 to 2002.
alt.Chart(trees_df).mark_bar().encode(
alt.X('count()'),
alt.Y('neighbourhood_name',sort='x')
)
We find those neighbourhoods which planted most trees from 1992 to 2002 are also the areas with most trees nowadays.
Besides,we would like to make some observations about the tree heights distributions over the years as a bonus to question 1.
# tree_size = ['height_range_id','diameter']
line = alt.Chart(trees_with_date_df).mark_line().encode(
alt.X('year_planted'),
alt.Y('mean(height_range_id)')
).properties(width=500,height=200)
point = alt.Chart(trees_df).mark_point().encode(
alt.X('year_planted'),
alt.Y('mean(height_range_id)')
).properties(width=500,height=200)
line + point
As time goes by, we find that trees planted in 1991 are either growing fastest or originally tallest and this is really interesting.We might find more about this in later exploration.
To answer this question, we'll explore the relationship between average tree size and the neighbourhoods.
alt.Chart(trees_df).mark_bar().encode(
alt.X('mean(diameter)'),
alt.Y('street_side_name'),
alt.Color('street_side_name')
).properties(width=500,height=200)
alt.Chart(trees_df).mark_bar().encode(
alt.X('mean(height_range_id)'),
alt.Y('street_side_name'),
alt.Color('street_side_name')
).properties(width=500,height=200)
We find that trees planted on both sides of the street are bigger and taller than those planted in the middle of the street. Trees are usually smaller especially in the bike area. It makes sense when we are looking at the trees on the street we usually feel the same way as the above plot shows us.
Now we are exploring the most wonderful neighbourhoods where there are most aboundant giant tall trees.
alt.Chart(trees_df).mark_bar().encode(
alt.X('mean(diameter)'),
alt.Y('neighbourhood_name',sort='x')
).properties(width=600,height=300)
alt.Chart(trees_df).mark_bar().encode(
alt.X('mean(height_range_id)'),
alt.Y('neighbourhood_name',sort='x')
).properties(width=600,height=300)
From the above plots we find that Kitsilano, Dunbar,Fairview,Shaughnessy and Kerrisdale are these great neighbourhoods where there are most big and tall trees. It is facinating that these neighbourhoods are all in the Vancouver West area and usually have the highest housing price as well.
Now let's take a look at how the trees are distributed in these top neighbourhoods by subplots.
top_neighbourhood_trees_df = trees_df[trees_df['neighbourhood_name'].isin(['Kitsilano', 'Dunbar-Southlands','Fairview','Shaughnessy','Kerrisdale'])]
alt.Chart(top_neighbourhood_trees_df).mark_bar().encode(
alt.X('diameter', bin=alt.Bin(maxbins=30)),
alt.Y('count()'),
alt.Color('neighbourhood_name')
).properties(width=200, height=150
).facet('neighbourhood_name',columns=3)
alt.Chart(top_neighbourhood_trees_df).mark_bar().encode(
alt.X('height_range_id', bin=alt.Bin(maxbins=30)),
alt.Y('count()'),
alt.Color('neighbourhood_name')
).properties(width=200, height=150
).facet('neighbourhood_name',columns=3)
From these subplots Fairview has the most fairly distributed trees of different sizes just like its name "Fairview"! What a fun fact!
alt.Chart(trees_df).mark_circle(size=500).encode(
alt.X('height_range_id', type='quantitative', bin=alt.Bin(maxbins=30)),
alt.Y('diameter', type='quantitative', bin=alt.Bin(maxbins=30)),
alt.Color('count()', title=None),alt.Size('count()',title=None)).properties(width=510, height=310)
Using both the colour and marker size to indicate the count creates an effective visualization in the above plot.We can easily learn that diameter less that 5 and height range between 1 and 1.5 are the most poluplar size of the trees in Vancouver. The trees with the diameter between 5 and 10 and height range between 2 and 2.5 go to the second place.
From the above exploratory visualizations,we are going to keep exploring and focus on fun facts about tree distributions in the report. Some of these are inspired by the quick and dirty EDA plots in the introduction part.Some columns of interest are date_planted,neighbourhood_name,diameter,height_range_id and street_side_name.
During the exploration of the data, we find some interesting aspects that are more related to people's compelling impressions of the Vacouver city such as prestigious communities with more giant trees VS newly developing communities with more lately planted trees. We also explore some other fun facts like trees distribution could fit its neighbourhood name perfectly like "Fairview".
Here are basically five key types of graphs as following:
From a heatmap plot,we learn that most trees were planted in Hastings-Sunrise,Kensington-Cedar Cottage and other neighbourhoods in the east of Vancouver from 1992 to 2002.
Through simple bar plots we can find the contrast distribution aspects among different street sides in Vancouver.
First we use simple bar plots to find the top neighbourhoods aboundant with most giant and tall trees.Coincedentally they are all located in the west of Vancouver.Then we use histogram subplots faceted with top neighbourhoods, we find a more fun fact about the trees distribution.
Using both the colour and marker size to indicate the count creates an effective visualization in the circle plot. It is easy to find out the most popular range of tree size in Vancouver.
Through the first question exploration, we open another door to someting more interesting. Using a line and point plot, we can easily find trees planted in 1991 are either growing fastest or originally tallest because they are the tallest trees nowadays.